To study and quantify human behavior using longitudinal multimodal digital data, it is essential to get to know the data well first. Theese data from various sources or sensors, such as smartphones and watches and activity trackers, yields data with different types and properties. The data may be a mixture of categorical, ordinal and numerical data, typically consisting of time series measured for multiple subjetcs from different groups. While the data is typically dense, it is also heterogenous and contain lots of missing values. Therefore, the analysis has to be conducted on many different levels.
This notebook introduces the Niimpy toolbox exploration module, which seeks to address the aforementioned issues. The module has functionalities for exploratory data analysis (EDA) of digital behavioral data. The module aims to produce a summary of the data characteristics, inspecting the structures underlying the data, to detecting patterns and changes in the patterns, and to assess the data quality (e.g., missing data, outliers). This information is highly essential for assessing data validity, data filtering and selection, and for data preprocessing. The module includes functions for plotting catogorical data, data counts, timeseries lineplots, punchcards and visualizing missing data.
Exploration module functions are supposed to run after data preprocessing, but those can be run also on the raw observations. All the functions are implemented by using Plotly Python Open sourde Library. Plotly enables interactive visualizations which in turn makers it easier to explore different aspects of the data (e.g.,specific timerange and summary statistics).
This notebook uses several sample dataframes for module demonstration. The sample data is already preprocessed, or will be preprocessed in notebook sections before visualizations. When the sample data is loaded, some of the key characteristics of the data are displayed.
All eploration module functions require the data to follow data schema. defined in the Niimpy toolbox documentation. The user must ensure that the input data follows the specified schema.
%%html
<style>
table {float:left}
</style>
The following table shows accepted data types, visualization functions and the purpose of each exploration sub-module.
| Sub-module | Data type | Functions | For what |
|---|---|---|---|
| Categorical plot | Categorical | Barplot | Observations counts and distributions |
| Count plot | Categorical* / Numerical | Barplot/Boxplot | Observation counts and distibutions |
| Lineplot | Numerical | Lineplot | Trend, cyclicity, patterns |
| Punchcard | Categorical* / Numerical | Heatmap | Temporal patterns of counts or values |
| Missingness | Categorical / Numerical | Barplot / Heatmap | Missing data patterns |
Data types denoted with * are not compatible with every function within the module.
This notebook uses following definitions referring to data:
user.group.Here we import modules needed for running this notebook.
import os
import sys
sys.path.append('../')
from pathlib import Path
import numpy as np
import pandas as pd
import niimpy
import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import warnings
from niimpy.preprocessing.survey import *
from niimpy.exploration import EDA_categorical, EDA_countplot, EDA_lineplot, EDA_missingness, EDA_punchcard, setup_dataframe
pio.templates.default = "seaborn"
px.defaults.template = "ggplot2"
px.defaults.color_continuous_scale = px.colors.sequential.RdBu
px.defaults.width = 1200
px.defaults.height = 482
warnings.filterwarnings("ignore")
This section introduces categorical plot module visualizes categorical data, such as questionnaire data responses.
We will demonstrate functions by using a mock survey dataframe, containing answers for:
The data will be preprocessed, and then it's basic characteristics will be summarized before visualizations.
# go to parent directory
os.chdir('..')
# Get the current working directory
cwd = os.getcwd()
# Define path for data
data_folder = os.path.join(cwd,"niimpy","sampledata","mock-survey.csv")
# Load a mock dataframe
df = niimpy.read_csv(data_folder,tz='Europe/Helsinki')
df.head()
| user | age | gender | Little interest or pleasure in doing things. | Feeling down; depressed or hopeless. | Feeling nervous; anxious or on edge. | Not being able to stop or control worrying. | In the last month; how often have you felt that you were unable to control the important things in your life? | In the last month; how often have you felt confident about your ability to handle your personal problems? | In the last month; how often have you felt that things were going your way? | In the last month; how often have you been able to control irritations in your life? | In the last month; how often have you felt that you were on top of things? | In the last month; how often have you been angered because of things that were outside of your control? | In the last month; how often have you felt difficulties were piling up so high that you could not overcome them? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20 | Male | several-days | more-than-half-the-days | not-at-all | nearly-every-day | almost-never | sometimes | fairly-often | never | sometimes | very-often | fairly-often |
| 1 | 2 | 32 | Male | more-than-half-the-days | more-than-half-the-days | not-at-all | several-days | never | never | very-often | sometimes | never | fairly-often | never |
| 2 | 3 | 15 | Male | more-than-half-the-days | not-at-all | several-days | not-at-all | never | very-often | very-often | fairly-often | never | never | almost-never |
| 3 | 4 | 35 | Female | not-at-all | nearly-every-day | not-at-all | several-days | very-often | fairly-often | very-often | never | sometimes | never | fairly-often |
| 4 | 5 | 23 | Male | more-than-half-the-days | not-at-all | more-than-half-the-days | several-days | almost-never | very-often | almost-never | sometimes | sometimes | very-often | never |
df.describe()
| user | age | |
|---|---|---|
| count | 1000.000000 | 1000.000000 |
| mean | 500.500000 | 26.911000 |
| std | 288.819436 | 4.992595 |
| min | 1.000000 | 12.000000 |
| 25% | 250.750000 | 23.000000 |
| 50% | 500.500000 | 27.000000 |
| 75% | 750.250000 | 30.000000 |
| max | 1000.000000 | 43.000000 |
The dataframe's columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids.
The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also the answers to meaningful numerical values.
Note: It's important that the dataframe follows the below schema before passing into niimpy.
# Convert column name to id, based on provided mappers from niimpy
col_id = {**PHQ2_MAP, **PSQI_MAP, **PSS10_MAP, **PANAS_MAP, **GAD2_MAP}
selected_cols = [col for col in df.columns if col in col_id.keys()]
# Convert data frame to long format
m_df = pd.melt(df, id_vars=['user', 'age', 'gender'], value_vars=selected_cols, var_name='question', value_name='answer')
# Assign questions to codes
m_df['id'] = m_df['question'].replace(col_id)
m_df.head()
| user | age | gender | question | answer | id | |
|---|---|---|---|---|---|---|
| 0 | 1 | 20 | Male | Little interest or pleasure in doing things. | several-days | PHQ2_1 |
| 1 | 2 | 32 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
| 2 | 3 | 15 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
| 3 | 4 | 35 | Female | Little interest or pleasure in doing things. | not-at-all | PHQ2_1 |
| 4 | 5 | 23 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
We can use a helper method to convert the answers into numerical value. The pre-defined mapper inside survey.py would be useful for this step.
# Transform raw answers to numerical values
m_df['answer'] = niimpy.preprocessing.survey.convert_to_numerical_answer(m_df,
answer_col='answer',
question_id='id',
id_map=ID_MAP_PREFIX,
use_prefix=True)
m_df.head()
| user | age | gender | question | answer | id | |
|---|---|---|---|---|---|---|
| 0 | 1 | 20 | Male | Little interest or pleasure in doing things. | 1 | PHQ2_1 |
| 1 | 2 | 32 | Male | Little interest or pleasure in doing things. | 2 | PHQ2_1 |
| 2 | 3 | 15 | Male | Little interest or pleasure in doing things. | 2 | PHQ2_1 |
| 3 | 4 | 35 | Female | Little interest or pleasure in doing things. | 0 | PHQ2_1 |
| 4 | 5 | 23 | Male | Little interest or pleasure in doing things. | 2 | PHQ2_1 |
We can also produce a summary of the questionaire's score. This function can describe aggregated score over the whole population, or specific subgroups.
First we'll show statistics for the whole population:
d1 = niimpy.preprocessing.survey.print_statistic(m_df)
pd.DataFrame(d1)
| PSS10 | GAD2 | PHQ2 | |
|---|---|---|---|
| min | 4.000000 | 0.000000 | 0.0000 |
| max | 27.000000 | 6.000000 | 6.0000 |
| avg | 14.006000 | 3.042000 | 3.0520 |
| std | 3.687759 | 1.536423 | 1.5855 |
Statistics by the group gender:
d2 = niimpy.preprocessing.survey.print_statistic(m_df, group='gender')
pd.DataFrame(d2)
| PSS10 | GAD2 | PHQ2 | ||||
|---|---|---|---|---|---|---|
| Female | Male | Female | Male | Female | Male | |
| min | 4.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 27.000000 | 23.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
| avg | 14.059063 | 13.954813 | 3.087576 | 2.998035 | 3.067210 | 3.037328 |
| std | 3.783230 | 3.596247 | 1.585157 | 1.488141 | 1.605337 | 1.567567 |
And finally statistics for PHQ questionnaires by group:
d3 = niimpy.preprocessing.survey.print_statistic(m_df, group='gender', prefix='PHQ')
pd.DataFrame(d3)
| PHQ | ||
|---|---|---|
| Female | Male | |
| min | 0.000000 | 0.000000 |
| max | 6.000000 | 6.000000 |
| avg | 3.067210 | 3.037328 |
| std | 1.605337 | 1.567567 |
We can now make some plots for the preprocessed data frame. First, we can display the summary for the specific question (PHQ-2 first question).
fig = EDA_categorical.questionnaire_summary(m_df,
question = 'PHQ2_1',
column = 'answer',
title='PHQ2 question: Little interest or pleasure in doing things',
xlabel='value',
ylabel='count',
width=600,
height=400)
fig.show()
The figure shows that the answer values (from 0 to 3) almost uniform in distribution.
We can also display the summary for each subgroup (gender).
fig = EDA_categorical.questionnaire_grouped_summary(m_df,
question='PSS10_9',
group='gender',
title='PSS10_9',
xlabel='score',
ylabel='count',
width=800,
height=400)
fig.show()
The figure shows that the differences between subgroups are not substantially high.
With some quick preprocessing, we can display the score distribution of each questionaire.
We'll extract PSS-10 questionnaire answers from the dataframe.
pss_sum_df = m_df[m_df['id'].str.startswith('PSS')] \
.groupby(['user', 'gender']) \
.agg({'answer':sum}) \
.reset_index()
pss_sum_df['id'] = 'PSS'
We'll quickly inspect the preprocessed dataframe.
pss_sum_df
| user | gender | answer | id | |
|---|---|---|---|---|
| 0 | 1 | Male | 15 | PSS |
| 1 | 2 | Male | 9 | PSS |
| 2 | 3 | Male | 12 | PSS |
| 3 | 4 | Female | 16 | PSS |
| 4 | 5 | Male | 14 | PSS |
| ... | ... | ... | ... | ... |
| 995 | 996 | Female | 17 | PSS |
| 996 | 997 | Female | 13 | PSS |
| 997 | 998 | Male | 13 | PSS |
| 998 | 999 | Male | 21 | PSS |
| 999 | 1000 | Male | 14 | PSS |
1000 rows × 4 columns
And then visualize grouped summary score distribution.
fig = EDA_categorical.questionnaire_grouped_summary(pss_sum_df,
question='PSS',
group='gender',
title='PSS10',
xlabel='score',
ylabel='count',
width=800,
height=400)
fig.show()
The figure shows that the grouped summary score distrubutions are close to each other.
This section introduces Countplot module. The module contain functions for user and group level observation count (number of datapoints per user or group) visualization and observation value distributions.
Observation counts use barplots for user level and a boxplots for group level visualizations. Boxplots are used for group level value distributions.
The module assumes that the visualized data is numerical.
We will use sample from StudentLife dataset to demonstrate the module functions. The sample contains hourly aggregated activity data (values from 0 to 5) and group information based on pre- and post-study PHQ-9 test scores. Study subjects have been grouped by the depression symptom severity into groups: none, mild, moderate, moderately severe, and severe. Preprocessed data sample is included in the Niimpy toolbox sampledata folder.
# Define path for data
data_path = os.path.join(cwd,"niimpy","sampledata","sl_activity.csv")
# Load data
sl = niimpy.read_csv(data_path,read_csv_options={'index_col':'timestamp'},tz='US/Eastern')
sl.index = pd.to_datetime(sl.index)
sl_loc = sl.tz_localize(None)
Before visualizations, we'll inspect the data.
sl_loc
| user | activity | group | |
|---|---|---|---|
| timestamp | |||
| 2013-03-27 06:00:00 | u00 | 2 | none |
| 2013-03-27 07:00:00 | u00 | 1 | none |
| 2013-03-27 08:00:00 | u00 | 2 | none |
| 2013-03-27 09:00:00 | u00 | 3 | none |
| 2013-03-27 10:00:00 | u00 | 4 | none |
| ... | ... | ... | ... |
| 2013-05-31 18:00:00 | u59 | 5 | mild |
| 2013-05-31 19:00:00 | u59 | 5 | mild |
| 2013-05-31 20:00:00 | u59 | 4 | mild |
| 2013-05-31 21:00:00 | u59 | 5 | mild |
| 2013-05-31 22:00:00 | u59 | 1 | mild |
55907 rows × 3 columns
sl_loc.describe()
| activity | |
|---|---|
| count | 55907.000000 |
| mean | 0.750264 |
| std | 1.298238 |
| min | 0.000000 |
| 25% | 0.000000 |
| 50% | 0.000000 |
| 75% | 1.000000 |
| max | 5.000000 |
sl_loc.group.unique()
array(['none', 'severe', 'mild', 'moderately severe', 'moderate'],
dtype=object)
At first we visualize the number of observations for each subject.
fig = EDA_countplot.EDA_countplot(sl,
fig_title='Activity event counts by user',
plot_type='count',
points='all',
aggregation='user',
user=None,
column=None,
binning=False)
fig.show()
The barplot shows that there are differences in user total activity counts. The user u24 has the lowest event count of 710 and users u02 and u59 have the highest count of 1584.
Next we'll inspect group level activity event counts aggregated by day. For the improved clarity, we select a timerange of one week from the data.
sl_one_week = sl_loc.loc['2013-03-28':'2013-4-3']
fig = EDA_countplot.EDA_countplot(sl_one_week,
fig_title='Group level activity event count boxplots by day',
plot_type='value',
points='all',
aggregation='group',
user=None,
column='activity',
binning='D')
fig.show()
The boxplot shows some variability in group level event count distributions across the days spanning from Mar 28 to Apr 3 2013.
Finally we visualize group level activity value distributions.
fig = EDA_countplot.EDA_countplot(sl,
fig_title='Group level activity score distributions',
plot_type='value',
points='outliers',
aggregation='group',
user=None,
column='activity',
binning=False)
fig.show()
The boxplot shows that activity score distribution for groups mild and moderately severe differ from the rest.
This section introduces Lineplot module functions. We use the same StudentLife dataset derived activity data as in previous section.
Lineplot functions display numerical feature values on time axis. The user can optionally resample (downsample) and smoothen the data for better visual clarity.
At first, we'll visualize single user single feature data, without resampling or smoothing.
fig = EDA_lineplot.timeplot(sl_loc,
users=['u01'],
columns=['activity'],
title='User {} activity'.format('u01'),
xlabel='Date',
ylabel='Value',
resample=False,
interpolate=False,
window=1,
reset_index=False)
fig.show()
The figure showing all the activity datapoints is difficult to interpet. By zooming in the time range, the daily patters come apparent. There is no or low activity during the night.
Next, we'll plot visualize the same data using resampling by hour, and 24 hour rolling window smoothing for improved visualization clarity. We also reset the index, showing now hours from the first activity feature observation.
fig = EDA_lineplot.timeplot(sl_loc,
users=['u00'],
columns=['activity'],
title='User activity',
xlabel='Date',
ylabel='Value',
resample='H',
interpolate=True,
window=24,
reset_index=True)
fig.show()
By zoomin in the smoothed lineplot, daily activity patterns are easier to detect.
Next visualization shows resamplig by day and 7 day rolling window smoothing, making the activity time series trend visible.
fig = EDA_lineplot.timeplot(sl_loc,
users=['u00'],
columns=['activity'],
title='User Activity' ,
xlabel='Date',
ylabel='Value',
resample='D',
interpolate=True,
window=7)
fig.show()
Daily aggregated and smoothed data makes the user activity trend visible. There is a peak at May 9 and the crest at May 23.
The following visualization superimposes three subject's activity on same figure.
fig = EDA_lineplot.timeplot(sl_loc,
users=['u00','u01'],
columns=['activity'],
title='User activity',
xlabel='Date',
ylabel='Value',
resample='D',
interpolate=True,
window=7)
fig.show()
The figure shows that the user daily averaged activity is quite similar in the beginning of inspected time range. In first two weeks of May, the activity shows opposing trends (user u00 activity increases and user u01 decreases).
Next we'll compare group level hourly average activity.
fig = EDA_lineplot.timeplot(sl_loc,
users='Group',
columns=['activity'],
title='User Activity',
xlabel='Date',
ylabel='Value',
resample='D',
interpolate=True,
window=7,
reset_index=False,
by='hour')
fig.show()
The time plot reveals that the hourly averaged group level activity follows circadian rhytmn (less activity during the night). Moderately severe group seems to be least active group during the latter half of the day.
And finally,
fig = EDA_lineplot.timeplot(sl_loc,
users='Group',
columns=['activity'],
title='User Activity',
xlabel='Date',
ylabel='Value',
resample='D',
interpolate=True,
window=7,
reset_index=False,
by='weekday')
fig.show()
The timeplot shows that there is some differences between the average group level activity, e.g., group mild being more active than moderately severe. Additionally, activity during Sundays is at lower level in comparison with weekdays.
This section introduces Punchcard module functions. The functions aggregate the data and show the averaged value for each timepoint.
We use the same StudentLife dataset derived activity data as in two previous sections.
At first we visualize one daily aggregated mean activity for single subject. We'll change the plot color to grayscale for improved clarity.
px.defaults.color_continuous_scale = px.colors.sequential.gray
EDA_punchcard.punchcard_plot(sl,
user_list=['u00'],
columns=['activity'],
title="User {} activity punchcard".format('u00'),
resample='D',
normalize=False,
agg_func=np.mean,
timerange=False)
The punchcard reveals that May 5th has the highest average activity and May 18th, 20th, and 21th have the lowest activity.
Next, we'll visualize mean activity for multiple subjects.
EDA_punchcard.punchcard_plot(sl,
user_list=['u00','u01','u02'],
columns=['activity'],
title="Users {}, {}, and {} activity punchcard".format('u00','u01','u02'),
resample='D',
normalize=False,
agg_func=np.mean,
timerange=False)
The punchard allows comparison of daily average activity for multiple subjects. It seems that there is not evident common pattern in the activity.
Lastly, we'll visualize daily aggregated single user activity side by side with activity of previous week.
We start by shifting the activity by one week and by adding it to the original dataframe.
sl_loc['previous_week_activity'] = sl_loc['activity'].shift(periods=7, fill_value=0)
EDA_punchcard.punchcard_plot(sl_loc,
user_list=['u00'],
columns=['activity','previous_week_activity'],
title="User {} activity and previous week activity punchcard".format('u00'),
resample='D',
normalize=False,
agg_func=np.mean,
timerange=False)
The punchcard show weekly repeating patterns in subjects activity.
This sections introduces Missingness module for missing data inspection. The module features data missingness visualizations by frequency and by timepoint.
Additionally, it offers an option for missing data correlation visualization.
For data missingness visualizations, we'll create a mock dataframe with missing values using niimpy.util.create_missing_dataframe function.
df_m = setup_dataframe.create_missing_dataframe(nrows=2*24*14, ncols=5, density=0.7, index_type='dt', freq='10T')
df_m.columns = ['User_1','User_2','User_3','User_4','User_5',]
We will quickly inspect the dataframe before the visualizations.
df_m
| User_1 | User_2 | User_3 | User_4 | User_5 | |
|---|---|---|---|---|---|
| 2022-01-01 00:00:00 | NaN | 14.943041 | 83.314725 | 47.333823 | NaN |
| 2022-01-01 00:10:00 | 21.950740 | 41.713543 | 33.220182 | NaN | 97.127834 |
| 2022-01-01 00:20:00 | 70.238813 | NaN | 38.834846 | 16.359167 | NaN |
| 2022-01-01 00:30:00 | NaN | 52.932116 | 38.032650 | NaN | 9.109500 |
| 2022-01-01 00:40:00 | 35.589858 | 32.961580 | NaN | 32.635323 | 44.374635 |
| ... | ... | ... | ... | ... | ... |
| 2022-01-05 15:10:00 | NaN | 98.544510 | NaN | 43.479354 | 26.906122 |
| 2022-01-05 15:20:00 | 80.961953 | 28.153202 | 2.823808 | 99.339907 | 91.207567 |
| 2022-01-05 15:30:00 | 20.273946 | 72.319544 | 19.097938 | 98.149423 | 23.711486 |
| 2022-01-05 15:40:00 | NaN | NaN | 81.226172 | 51.119388 | 91.609733 |
| 2022-01-05 15:50:00 | 32.282703 | 44.677950 | 99.720209 | 44.347743 | 50.098389 |
672 rows × 5 columns
df_m.describe()
| User_1 | User_2 | User_3 | User_4 | User_5 | |
|---|---|---|---|---|---|
| count | 465.000000 | 473.000000 | 449.000000 | 481.000000 | 484.000000 |
| mean | 49.281451 | 50.887490 | 51.945671 | 47.954949 | 51.518216 |
| std | 28.774140 | 28.489196 | 28.283426 | 28.905343 | 28.805660 |
| min | 1.004597 | 1.025929 | 1.167025 | 1.199196 | 1.635838 |
| 25% | 25.184465 | 28.153202 | 29.191064 | 22.288528 | 27.629188 |
| 50% | 47.341653 | 51.026320 | 53.494208 | 45.089204 | 50.080527 |
| 75% | 74.707383 | 73.794838 | 75.450992 | 72.195600 | 77.158562 |
| max | 99.952618 | 99.991345 | 99.859195 | 99.867979 | 99.931392 |
First, we create a histogram to visualize data frequency per column.
fig = EDA_missingness.bar(df_m,
xaxis_title='User',
yaxis_title='Frequency')
fig.show()
The data frequency is nearly similar for each user, User_5 having the highest frequency.
Next, we will show average data frequency for all users.
fig = EDA_missingness.bar(df_m,
sampling_freq='30T',
xaxis_title='Time',
yaxis_title='Frequency')
fig.show()
The overall data frequency suggests no clear pattern for data missingness.
We can also create a missingness matrix visualization for the dataframe. The nullity matrix show data missingess by a timepoint.
fig = EDA_missingness.matrix(df_m,
sampling_freq='30T',
xaxis_title="User ID",
yaxis_title="Time")
fig.show()
Finally, we plot a heatmap to display the correlations between missing data.
Correlation ranges from -1 to 1:
For the correlations, we use NYC collision factors sample data.
collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")
First, we'll inspect the data frame.
collisions.head()
| DATE | TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 1 | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11/10/2016 | 16:11:00 | BROOKLYN | 11208.0 | 40.662514 | -73.872007 | (40.6625139, -73.8720068) | WORTMAN AVENUE | MONTAUK AVENUE | NaN | ... | Failure to Yield Right-of-Way | Unspecified | NaN | NaN | NaN | TAXI | PASSENGER VEHICLE | NaN | NaN | NaN |
| 1 | 11/10/2016 | 05:11:00 | MANHATTAN | 10013.0 | 40.721323 | -74.008344 | (40.7213228, -74.0083444) | HUBERT STREET | HUDSON STREET | NaN | ... | Failure to Yield Right-of-Way | NaN | NaN | NaN | NaN | PASSENGER VEHICLE | NaN | NaN | NaN | NaN |
| 2 | 04/16/2016 | 09:15:00 | BROOKLYN | 11201.0 | 40.687999 | -73.997563 | (40.6879989, -73.9975625) | HENRY STREET | WARREN STREET | NaN | ... | Lost Consciousness | Lost Consciousness | NaN | NaN | NaN | PASSENGER VEHICLE | VAN | NaN | NaN | NaN |
| 3 | 04/15/2016 | 10:20:00 | QUEENS | 11375.0 | 40.719228 | -73.854542 | (40.7192276, -73.8545422) | NaN | NaN | 67-64 FLEET STREET | ... | Failure to Yield Right-of-Way | Failure to Yield Right-of-Way | Failure to Yield Right-of-Way | NaN | NaN | PASSENGER VEHICLE | PASSENGER VEHICLE | PASSENGER VEHICLE | NaN | NaN |
| 4 | 04/15/2016 | 10:35:00 | BROOKLYN | 11210.0 | 40.632147 | -73.952731 | (40.6321467, -73.9527315) | BEDFORD AVENUE | CAMPUS ROAD | NaN | ... | Failure to Yield Right-of-Way | Failure to Yield Right-of-Way | NaN | NaN | NaN | PASSENGER VEHICLE | PASSENGER VEHICLE | NaN | NaN | NaN |
5 rows × 26 columns
collisions.dtypes
DATE object TIME object BOROUGH object ZIP CODE float64 LATITUDE float64 LONGITUDE float64 LOCATION object ON STREET NAME object CROSS STREET NAME object OFF STREET NAME object NUMBER OF PERSONS INJURED int64 NUMBER OF PERSONS KILLED int64 NUMBER OF PEDESTRIANS INJURED int64 NUMBER OF PEDESTRIANS KILLED int64 NUMBER OF CYCLISTS INJURED float64 NUMBER OF CYCLISTS KILLED float64 CONTRIBUTING FACTOR VEHICLE 1 object CONTRIBUTING FACTOR VEHICLE 2 object CONTRIBUTING FACTOR VEHICLE 3 object CONTRIBUTING FACTOR VEHICLE 4 object CONTRIBUTING FACTOR VEHICLE 5 object VEHICLE TYPE CODE 1 object VEHICLE TYPE CODE 2 object VEHICLE TYPE CODE 3 object VEHICLE TYPE CODE 4 object VEHICLE TYPE CODE 5 object dtype: object
We will then inspect the basic statistics.
collisions.describe()
| ZIP CODE | LATITUDE | LONGITUDE | NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLISTS INJURED | NUMBER OF CYCLISTS KILLED | |
|---|---|---|---|---|---|---|---|---|---|
| count | 6919.000000 | 7303.000000 | 7303.000000 | 7303.000000 | 7303.000000 | 7303.000000 | 7303.000000 | 0.0 | 0.0 |
| mean | 10900.746640 | 40.717653 | -73.921406 | 0.350678 | 0.000959 | 0.133644 | 0.000822 | NaN | NaN |
| std | 551.568724 | 0.069437 | 0.083317 | 0.707873 | 0.030947 | 0.362129 | 0.028653 | NaN | NaN |
| min | 10001.000000 | 40.502341 | -74.248277 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
| 25% | 10310.000000 | 40.670865 | -73.980744 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
| 50% | 11211.000000 | 40.723260 | -73.933888 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
| 75% | 11355.000000 | 40.759527 | -73.864463 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
| max | 11694.000000 | 40.909628 | -73.702590 | 16.000000 | 1.000000 | 3.000000 | 1.000000 | NaN | NaN |
Finally, we will visualize the nullity (how strongly the presence or absence of one variable affects the presence of another) correlations by a heatmap and a dendrogram.
fig = EDA_missingness.heatmap(collisions)
fig.show()
The nullity heatmap and dendrogram reveals a data correlation structure, e.g., vehicle type codes and contributing factor vehicle are highly correlated. Features having complete data are not shown on the figure.